Investigations on large-scale lightly-supervised training for statistical machine translation

نویسنده

  • Holger Schwenk
چکیده

Sentence-aligned bilingual texts are a crucial resource to build statistical machine translation (SMT) systems. In this paper we propose to apply lightly-supervised training to produce additional parallel data. The idea is to translate large amounts of monolingual data (up to 275M words) with an SMT system, and to use those as additional training data. Results are reported for the translation from French into English. We consider two setups: first the intial SMT system is only trained with a very limited amount of human-produced translations, and then the case where we have more than 100 million words. In both conditions, lightly-supervised training achieves significant improvements of the BLEU score.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lightly-Supervised Training for Hierarchical Phrase-Based Machine Translation

In this paper we apply lightly-supervised training to a hierarchical phrase-based statistical machine translation system. We employ bitexts that have been built by automatically translating large amounts of monolingual data as additional parallel training corpora. We explore different ways of using this additional data to improve our system. Our results show that integrating a second translatio...

متن کامل

Pivot Lightly-Supervised Training for Statistical Machine Translation

In this paper, we investigate large-scale lightly-supervised training with a pivot language: We augment a baseline statistical machine translation (SMT) system that has been trained on human-generated parallel training corpora with large amounts of additional unsupervised parallel data; but instead of creating this synthetic data from monolingual source language data with the baseline system it...

متن کامل

Translation Model Adaptation for an Arabic/French News Translation System by Lightly-Supervised Training

Most of the existing, easily available parallel texts to train a statistical machine translation system are from international organizations that use a particular jargon. In this paper, we consider the automatic adaptation of such a translation model to the news domain. The initial system was trained on more than 200M words of UN bitexts. We then explore large amounts of in-domain monolingual t...

متن کامل

Semi-Supervised Training for Statistical Word Alignment

We introduce a semi-supervised approach to training for statistical machine translation that alternates the traditional Expectation Maximization step that is applied on a large training corpus with a discriminative step aimed at increasing word-alignment quality on a small, manually word-aligned sub-corpus. We show that our algorithm leads not only to improved alignments but also to machine tra...

متن کامل

Learning speech translation from interpretation

Globalization spurs the need for cross-lingual verbal communication. This is reflected in ongoing research in the field of speech translation. However, the development of deployable speech translation systems still happens only for a handful of languages. Prohibitively high costs attached to the acquisition of sufficient amounts of suitable training data are one of the main reasons for this sit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008